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One mechanism by which disease-associated DNA variation can alter disease risk is altering gene expression. 
However, linkage disequilibrium (LD) between variants, mostly single-nucleotide polymorphisms (SNPs), 
means it is not sufficient to show that a particular variant associates with both disease and expression, as 
there could be two distinct causal variants in LD. Here, we describe a formal statistical test of colocalization 
and apply it to type 1 diabetes (T1 D)-associated regions identified mostly through genome-wide association 
studies and expression quantitative trait loci (eQTLs) discovered in a recently determined large monocyte ex- 
pression data set from the Gutenberg Health Study (1370 individuals), with confirmation sought in an additional 
data set from the Cardiogenics Transcriptome Study (558 individuals). We excluded 39 out of 60 overlapping 
eQTLs in 49 T1D regions from possible colocalization and identified 21 coincident eQTLs, representing 21 
genes in 14 distinct T1D regions. Our results reflect the importance of monocyte (and their derivatives, macro- 
phage and dendritic cell) gene expression in human T1 D and support the candidacy of several genes as causal 
factors in autoimmune pancreatic beta-cell destruction, including AFF3, CD226, CLECL1, DEXI, FKRP, PRKD2, 
RNLS, SMARCE1 and SUOX, in addition to the recently described GPR183 (EBI2) gene. 



INTRODUCTION 

Genome-wide association studies (GWAS) have identified 
multiple markers, usually single-nucleotide polymorphisms 
(SNPs), associated with risk of common diseases, and atten- 
tion has now turned to explaining the underlying molecular 
mechanisms. Currently, a common hypothesis is that a propor- 
tion of the causal variants tagged by these disease-associated 
markers may affect the abundance of a protein or the relative 
abundance of its different isoforms by altering transcription 



and/or splicing. Indeed, researchers have identified several 
SNPs associated with both a disease and expression of a 
nearby gene and proposed that this reflects a common causal 
molecular mechanism (1). However, linkage disequilibrium 
(LD) between variants means it is possible that the two 
traits, disease susceptibility and gene expression, are affected 
by distinct causal variants in LD. For example, association of 
the same SNPs with both type 1 diabetes (T1D) and RPS26 ex- 
pression in lymphoblastoid cell lines was previously inter- 
preted to imply that RPS26 was the T1D causal gene in the 
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region (2). However, a formal statistical analysis demonstrated 
that it was considerably more likely that two distinct causal 
variants existed, one underlying each trait (3). 

T1D has now been associated with 53 loci across the human 
genome (4-6). Although we, and others, have named attractive 
candidate genes in >60% of these regions (6), the evidence 
from direct functional studies supporting causality is often 
limited. Many T1D loci will overlap with expression quantita- 
tive trait loci (eQTLs), either by chance or due to common 
mechanism, and therefore will contain SNPs associated with ex- 
pression of nearby or distant genes. Together with other func- 
tional evidence and results such as animal model data, such 
observations have led to the localization of causal genes and 
pathways, improving knowledge of the aetiology of this multi- 
factorial disease. For example, the correlation between INS 
alleles and INS expression in human thymus (7) and correlations 
between IL2RA SNPs and levels of RNA and protein (8) have 
led to the identification of INS and IL2RA as causal for T1D. 
Statistical evidence that the T1D and expression signals coloca- 
lize, i.e. are compatible with the hypothesis of a common causal 
variant, would help prioritize a particular gene as potentially 
causal in T1D and justify further exploration of the relevant 
physiological pathway. 

As gene expression and eQTLs may be tissue specific 
(8—10), it is important to study disease-relevant tissues. T1D 
is very strongly associated with functional amino acid poly- 
morphisms of the antigen-presenting HLA class II molecules 
(1 1), and one of the relevant cell types in T1D are monocytes, 
which are the circulating precursors of the major antigen- 
presenting cells in the immune system, dendritic cells and 
macrophages. The T1D susceptibility gene, GPRI83, is asso- 
ciated with monocyte gene expression (12); macrophages are 
evident in the autoimmune infiltrate of pancreatic islets in 
histological analysis of autopsy tissue from patients with 
T1D (13); and their blockade reduces T1D frequency in non- 
obese diabetic mice (14). The Gutenberg Health Study (GHS) 
(15) has measured gene expression in fresh (purified) mono- 
cytes in 1490 (1370 non-diabetic) subjects and, as such, is 
the largest available data set for conducting colocalization 
analysis with T1D (12). 

In order to conduct a formal colocalization analysis across all 
known T1D loci, we first considered an alternative derivation of 
the statistical test presented by Plagnol et al. (3). The original 
test relied on standard statistical asymptotics which may not 
hold, because the likelihood is commonly bimodal and rarely 
quadratic near a maximum (Supplementary Material, Fig. SI). 
Here, we present this statistical method and use it to formally 
test for colocalization between eQTLs in the GHS and T1D 
signals across 49 regions outside of HLA known to be asso- 
ciated with T1D (6). We use an additional monocyte expression 
data set from the Cardiogenics Transcriptome Study (CTS) to 
seek confirmation. Our results, identifying genes whose eQTL 
signals colocalize with T1D signals, can be used to direct 
detailed future study of certain T1D loci. 

RESULTS 

We generated a comprehensive map of both cis and trans 
monocyte eQTL patterns found in 1370 non-diabetic subjects 



from the GHS across 49 associated T1D loci listed in 
Tl DBase (6) (Supplementary Material, Table SI). The 49 
regions in total comprise 19 Mb. The HLA region was 
excluded from analysis as the complex pattern of LD, which 
differs between cases and controls, would violate one of the 
assumptions of the test — that LD does not differ between 
cohorts. We identified a total of 60 genotype-probe expression 
associations with P < 10" 8 (53 cis effects) or P < 10" 10 (7 
trans effects) in the GHS data set (Supplementary Material, 
Table S2). Fifty of these probes were also available in the 
CTS and all showed normalized fold changes in the same dir- 
ection in the two data sets. 

There are a number of differences between the GHS and 
CTS data sets; chief among them, the GHS is a cohort study 
which used negative selection to isolate monocytes, whereas 
the CTS is a study of coronary artery disease (CAD) and myo- 
cardial infarction (MI) cases and controls which used positive 
selection. Either case status or positive selection, which can 
activate cells, may create differences in expression and 
hence in eQTLs. For this reason, we took a cautious approach 
to the inclusion of the CTS data, testing first for a significant 
eQTL effect in the CTS data, and second for evidence of colo- 
calization of the GHS and CTS, only including the CTS data 
when there was no evidence against colocalization at a conser- 
vative threshold of P > 0.01. 

It is important to note that our statistical test is constructed 
with a null hypothesis of colocalization. Thus, small P-values 
allow us to reject the null and to conclude it is unlikely that 
disease susceptibility and gene expression share the same 
underlying causal variant, i.e. unlikely that T1D association 
in a region is mediated by monocyte expression differences 
of the gene under test. However, larger P-values could corres- 
pond either to genuine colocalization or failure to reject the 
null due to insufficient statistical power. For this reason, we 
present our complete results sorted by an overall P-value (Sup- 
plementary Material, Table S3). We could exclude colocaliza- 
tion of T1D and monocyte expression signals for probes in 39 
genes (P < 0.0008; Bonferroni correction of a = 0.05 for 60 
tests), including RPS26. This left 21 probes for which we 
cannot exclude colocalization (Table 1), which are now 
worthy of follow-up. These include one potential trans 
effect at the T1D locus 12ql3.2 corresponding to a probe in 
DCAF16 on chromosome 4pl5 (6), a region which does not 
contain any known T ID-associated SNPs, with the remainder 
acting in cis. 

The test statistic from our alternatively derived colocaliza- 
tion test is, in fact, identical to that from Plagnol et a/.'s (3) 
asymptotic derivation (see Supplementary Material), but infer- 
ence is clearer under our alternative derivation, where we use 
posterior predictive P- values (16,17) to evaluate significance. 
We find that posterior predictive P-values are very close to 
asymptotic P-values (Fig. 1), suggesting that our concerns 
about the asymptotics were unfounded, in data sets of this 
size at least. In smaller sample sizes, the new approach 
could still be preferred. 

Finally, one region included here, on chromosome 2ql 1 .2, 
has been associated with rheumatoid arthritis (18-20) and ju- 
venile idiopathic arthritis (21) and is known to function in the 
immune system. It has been assigned previously as a T1D 
locus (22) based on its association with the other autoimmune 
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Table 1. Twenty-one probes which are consistent with colocalization of monocyte expression signals and T1D association (P > 0.0008) 



Region 


Probe 




Gene 


GHS-CTS 


r 2 


Sign (17) 


P-value 


2qll.2 


ILMN 


.1775235 


AFF3 


No 


0.030 


+ 


0.082 


6q25.3 


ILMN_ 


.1795937 


EZR 


Yes 


0.027 


— 


0.00486 


6q25.3 


ILMN_ 


.1788223 


RSPH3 


Yes 


0.359 


+ 


0.196 


10q23.31 


ILMN_ 


.1718520 


RNLS 


Yes 


0.043 


+ 


0.252 


12pl3.31 


ILMN_ 


.1782729 


CLECLl 


No 


0.711 


+ 


0.0804 


12ql3.2 


ILMN_ 


.1753440 


DCAF16 


Yes 


0.122 


+ 


0.109 


12ql3.2 


ILMN_ 


.1803745 


SUOX 


Yes 


0.062 


— 


0.00171 


12ql3.3 


ILMN_ 


.1725079 


TSPAN31 


t 


0.041 


+ 


0.0238 


12ql3.3 


ILMN_ 


.1723846 


FAM119B 


Yes 


0.518 


— 


0.0014 


12ql3.3 


ILMN_ 


.2097954 


TSFM 


t 


0.058 


+ 


0.0027 


13q32 


ILMN_ 


.2168217 


GPR183 


Yes 


0.043 




0.214 


16pll.2 


ILMN_ 


.1701477 


CCDC101 


Yes 


0.117 


+ 


0.135 


16pl3.13 


ILMN_ 


.1738866 


DEXI 


Yes 


0.132 




0.00245 


16pl3.13 


ILMN_ 


.1655244 


LOC642755 




0.131 




0.0010 


17q21.2 


ILMN_ 


.1747857 


SMARCEl 


No 


0.260 




0.139 


18q22.2 


ILMN_ 


1687825 


CD226 


t 


0.047 






19ql3.2 


ILMN_ 


.1681296 


ICAM4 


Yes 


0.142 




0.00145 


19ql3.2 


ILMN_ 


.2212763 


ICAM3 


Yes 


0.047 


+ 


0.00195 


19ql3.32 


ILMN_ 


.1753805 


PRKD2 


Yes 


0.089 




0.114 


19ql3.32 


ILMN_ 


.2368617 


FKRP 


Yes 


0.085 




0.00732 


Xq28 


ILMN_ 


.1808356 


FAM3A 


Yes 


0.185 


+ 


0.0101 



GHS-CTS indicates whether the CPG signal colocalizes with GHS ('Yes' or 'No'); ' — ' indicates cases where probe was not present in the CTS and J where the 
CTS effect was not significant, r 2 shows the proportion of variance in expression explained by the best SNP(s) in the GHS data set. Sign (17) indicates whether 17 is 
positive or negative, i.e. whether increased expression correlates with T1D susceptibility (' + ') or protection (' — ') in GHS versus WTCCC. ' A ' indicates cases 
where only one SNP is required to capture both the eQTL and T1D signal, i.e. where the data are consistent with the null and a formal colocalization test is neither 
needed nor possible. 

We have prepared an R (23) package, coloc, which imple- 
ments the tests described here and is available from CRAN 
(http ://cran.r-project.org). 

DISCUSSION 

The close correspondence between the asymptotic and the pos- 
terior predictive P-values suggests that our caution regarding 
whether asymptotics would hold for non-quadratic, bimodal 
likelihood was overstated. However, we show (Supplementary 
Material, Fig. S2) that, with smaller sample sizes, this is gen- 
erally not the case and we recommend that both tests are con- 
sidered, rather than assuming the asymptotic theory will 
always hold. 

Previous colocalization analyses have taken a conditional 
approach, considering the degree to which each expression 
trait was explained by the most strongly disease-associated 
SNP in a region [e.g. using lymphoblastoid cell lines expres- 
sion data (24)] or examining by eye the similarity between 
the association profiles [e.g. comparing expression from 
whole blood and celiac disease association (1)]. Conditional 
approaches tend to begin with the most strongly associated 
SNP in each data set and then consider how the evidence for 
association in one data set changes when the best SNP from 
the other data set is included in the model. Methods which 
depend on single-SNP analyses may fail to capture the com- 
plexity of the data when either multiple signals exist or mul- 
tiple SNPs are needed to explain a signal. Even where a 
single SNP is sufficient, if the most associated SNP is not 
the causal variant, as is commonly the case in existing 
GWAS, we would expect some residual association to 
remain even in the case where the two traits share a causal 




asymptotic p 



Figure 1. Comparison of asymptotic and posterior predictive P-values, logio 
scale. 

diseases and on evidence of association of SNPs in the region 
with T1D at a threshold of P < 10~ 4 . In the present study, we 
present stronger evidence that the 2ql 1.2 locus is significantly 
associated with T1D risk (P= 8.95 x 10~ 8 ; Table 2), using 
additional genotyping and samples. We find no evidence 
against colocalization of increased monocyte AFF3 expression 
with T1D risk in this region, suggesting AFF3 as the potential 
causal gene in the region. 
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Table 2. Association of rs4851256 C >T/rs4851253 T > G with T1D in case, control and family samples 



Cohort 


N 


MAF 


OR/RR (minor) 


95% CI 


/"-value 


Barrett et al. (4): rs4851256 C >T 














4913/7341 


0.3 84 a 


0.86 b 


0.79-0.94 b 


8.6 x 10~ 7 


Replication: rs4851253 T >G 












Families 


3411 


0.376 


0.93 


0.87-1.00 


0.0489 


Replication case-control 


4215/4428 


0.366 


0.93 


0.88-0.99 


0.0317 


Combined (Mantel-Haenszel method) 










0.00358 


Combined result (Fisher's method) 










8.95 x 10~ 8 



rs4851256 was the most associated SNP in the AFF3/2qil.2 region in the GWAS meta-analysis (4), but we could not design a TaqMan assay to genotype the 
replication resource. In the present study, a proxy was selected, rs4851253, which has r 2 = 1 with rs4851256 in the HapMap CEU samples. N is the number of 
cases/controls or the number of informative transmissions for family data. MAF, frequency of the minor allele (the minor allele is listed as the second allele after 
the rsnumber) in controls or parents; OR, odds ratio; CI, confidence interval. 

a Estimated from 3342 controls genotyped directly by the WTCCC, as genotypes were imputed in the T1DGC cohort. 
b Estimated from 1930 cases/3342 controls genotyped directly by the WTCCC. 



variant, potentially biasing any test towards rejecting colocali- 
zation. Further, the null hypothesis implicitly tested by condi- 
tional methods, namely that the most associated SNP in one 
data set can completely explain the genetic association in a 
region in a different data set, is not the same as the null hy- 
pothesis that is likely to be of scientific interest, namely that 
the same variants underlie both traits. We do not replicate 
here any of the putative T1D colocalizations identified by 
these studies. Of the three colocalized loci identified by Nica 
et al. (24), we excluded the HLA locus from study and the 
other two do not appear to be monocyte eQTLs. We also 
rejected colocalization of T1D with monocyte expression of 
the genes listed by Dubois et al. (1) in these regions: 
TMEM116 and ALDH2 in 12q24.2 (?=lx 10" 23 and P = 
1 x 10" 20 , respectively) and TLR8 inXp22.2 (P=3 x 10" 4 ). 

One of the most striking features of our results is that the 
expression of relatively few genes appears consistent with 
colocalized T1D signals. Colocalization can be excluded for 
39 out of 60 probes (P < 0.0008), and none of these is 
among our online list of the 36 most likely T1D candidate 
genes in 35 regions (6) (http://wwww.tldbase.org/page/ 
Regions). Among the remaining 21 probes which were con- 
sistent with colocalization (P > 0.0008, Table 1), the 
.P-values appear to be skewed towards smaller values than 
might be expected by chance if all were genuinely colocalized. 
This may reflect the method by which eight of these P-values 
were calculated (the minimum of P-values from three non- 
independent tests in the case where colocalization testing 
was not possible in the CTS), a lack of power to detect depar- 
tures from the null, and/or a combination of subtle differences 
in LD between populations and the use of imputation, all of 
which would tend to create bias against the null. This empha- 
sizes that neither the existence of an eQTL nor lack of evi- 
dence against colocalization is enough to confirm a gene as 
causal for T1D. As with any colocalization analysis, these 
results are merely a tool that will enable prioritization of 
genes for detailed functional follow-up work. 

Partially as a consequence of these results, we are actively 
pursuing DEXI (16pl3.3) as a T1D candidate gene, finding, 
for example, that sequences in intron 19 of CLEC16A, 
where many of the most strongly T1D associated SNPs lie, 
interact physically with the promoter of DEXI, supporting 



our hypothesis that DEXI expression is a causal factor in 
T1D (25). Previously, CLEC16A itself was considered the 
favoured candidate gene (26). SUOX, one of a cluster of 
eQTL genes in the 12ql3.2 region, most of which can be com- 
prehensively rejected (P < 1 x 10 7 ), encodes sulphite 
oxidase and occurs in the mitochondrial membrane. Our 
results point to SUOX as a potential candidate for T1D, and, 
despite a lack of strong candidature, links can be drawn 
from the literature (27). Bisulphite and sulphite food preserva- 
tives have been associated with allergy and asthma and could 
affect the immune system and beta-cell functions, and our 
results point to SUOXsls a new candidate gene for T1D. Simi- 
larly, our results suggest that, in 19pl3.2, ICAM4 and/or 
ICAM3 may be worth consideration alongside the currently 
favoured TYK2 (5). 

A very strong candidate gene that our results point to is 
CLECL1, and this could be particularly informative since the 
region contains two other strong functional candidates, 
CD69 and CLEC2D. The CLECL1 eQTL is particularly 
strong, with associated SNPs explaining 71% of the variation 
in expression levels (Table 1 ) and each copy of the A allele of 
the most associated SNP, rs7970116, leading to a 0.6-fold in- 
crease in normalized expression (Supplementary Material, 
Table S2). CLECL1 encodes C-type lectin 1 or DCAL1 (den- 
dritic cell associated lectin 1), which functions as a receptor on 
myeloid cells, such as dendritic cells, to deliver costimulatory 
signals to subsets of T cells, and can affect the maturation of 
monocytes into dendritic cells (28). Coincident T1D and ex- 
pression signals for both I2pl3. 31/CLECL1 and 12ql3.2/ 
SUOX were also identified in a recent study in CD4+ lympho- 
cytes (29). A smaller expression data set was available (200 
samples), and the analysis was restricted to testing whether 
the disease-associated SNP showed association with the ex- 
pression, but our evidence which suggests cross-tissue coloca- 
lization adds further support to these genes as possible T1D 
candidates. 

Two regions with eQTL-associated genes which are consist- 
ent with colocalization, RSPH3 and EZR (or VIE) in 6q25.3, 
and CCDCl 01 in 16pll.2 (Table 1), contain previously iden- 
tified strong immune functional candidate genes, TAGAP and 
IL27, respectively. However, as neither of the candidate genes 
shows an eQTL in monocytes in the GHS data set, these data 
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cannot provide further information on their candidacy, and 
they remain strong candidates because of their major roles 
in the adaptive immune system. Expression of IL27, in particu- 
lar, has been correlated with an inflammatory bowel 
disease-associated SNP (30), emphasizing the tissue-specific 
nature of gene expression and the need to study a relevant 
tissue for any colocalization study. 

Another region of interest is on chromosome 13q32. We, 
and collaborators, recently identified an interferon regulatory 
factor 7 (IRF7)-driven inflammatory network (IDIN) in both 
rats and humans. This was controlled in humans by a locus 
on chromosome 13q32 which was also associated with T1D 
(12). The gene regulating this IDIN was shown to be 
GPR183, a G-protein-coupled receptor that controls B-cell mi- 
gration and for which the ligands, oxysterols, have recently 
been discovered (31-34). However, we could not demonstrate 
that the T1D association in the region was mediated by 
GPR183 expression because the likelihood for Plagnol 
et a/.'s (3) test was bimodal and we were not confident in 
the statistical asymptotics. GPR183 has a strong monocyte 
eQTL in the region, and with the newly developed test pre- 
sented here, we find no evidence against colocalization (P = 
0.13), supporting GPR183 as potentially causal for T1D in 
this region. Also interesting are AFF3, in 2ql 1.2, a region pre- 
viously associated with arthritis (18,21) and for which we 
present further evidence of association with T1D in this 
paper (Table 2) and SMARCE1 in 17q21.2. 

In one region, 18q22.2, we did not apply our test because a 
single SNP was sufficient to capture both the expression and 
T1D signals, a situation which is compatible with the null. 
CD226 encodes a cell surface receptor involved in adhesion, 
signalling and effector functions of lymphocytes and natural 
killer cells and is thought likely to be the causal gene in the 
18q22.2 interval, given that the same non-synonymous SNP 
(Gly307Ser, rs763361) is associated with multiple auto- 
immune diseases, including multiple sclerosis (MS) (35), 
and that anti-CD226 treatment can delay the onset of disease 
in an experimental model of MS (36). However, our results, 
which imply that T1D risk correlates with reduced CD226 
mRNA, suggest that the T1D association may also be 
mediated by the direct effect of Gly307Ser on the CD226 
protein (35), and it has been proposed previously that this 
variant could affect splicing of exon 7, which contains the 
SNP (26). Reduced CD226 mRNA could cause reduced cell 
activation on cross-linking, and these results highlight the im- 
portance of studying the expression and function of CD226 in 
purified monocytes and the interactions of monocytes with 
lymphocytes and other immune cells and the CD226 receptors, 
CD112 and CD155. 

As the adoption of dense, disease-specific genotyping 
chips such as Immunochip (37; Immunobase, http://www. 
immunobase.org) increases, it is possible that the situation of 
single SNPs explaining two trait signals may become more 
common, implying that any formal colocalization test would 
become redundant. This scenario would be consistent with a 
hypothesis of a single causal variant common to both traits. 
Alternatively, if there exist multiple common causal variants, 
then dense genotyping should enable the multiple SNPs 
required to describe the SNP-trait associations to be identified, 
and this test will be more widely applicable, with the caveat 



that all causal variants must operate in the same manner for 
the data to appear consistent with the null. Early indications 
are that both of these situations will arise, with multiple 
signal regions forming a substantial minority (38). 

All of our results will need follow-up work, and we have 
begun this with CD226. As a first step, we have confirmed 
the existence of the eQTL in monocytes, using the alternative 
approach of allele-specific expression and observed that the 
direction of allelic imbalance is the same as observed here, 
with reduced expression correlated with the T1D risk allele 
(see Supplementary Material). 

Our formal colocalization analysis of T1D and monocyte 
gene expression has identified genes that should be prioritized 
for follow-up work in regions associated with T1D, and 
excluded some genes as likely to be causal through their 
action in monocytes. However, as eQTLs can be tissue specif- 
ic, we cannot exclude the possibility that there is an alternative 
expression altering mechanism in another relevant tissue (such 
as CD4+ T cells) which would explain the T1D association. 
The test as designed could be applied to any pair of traits, 
and interesting future applications of our test include analysis 
of eQTL signals from different tissues or overlapping disease 
loci for T1D and other autoimmune diseases. A formal ana- 
lysis is particularly useful, we believe, because it allows the 
researcher to explicitly evaluate the strength of evidence 
against colocalization and to rank genes for follow-up. 
Further analysis with denser SNP genotyping chips, locus- 
specific allele-specific expression analyses and more inform- 
ative arrays (or even RNA sequencing) is now strongly 
justified, alongside investigations of which cellular phenotypes 
are altered by the CD226 causal variant, using flow cytometry 
as we have previously described for the T1D and MS locus 
IL2RA (8). 

MATERIALS AND METHODS 

Samples 

Gutenberg Health Study samples 

Subjects. A total of 1490 study participants of both sexes aged 
35-74 years were successively enrolled into the GHS, a white 
European, community-based, single centre, prospective cohort 
study in the Rhein-Main region in western mid-Germany (15). 
All subjects gave written informed consent. Ethical approval 
was given by the local ethics committee and by the local 
and federal data safety commissioners. 

Genotype Data. Genome-wide variability genotyping was per- 
formed using the Affymetrix Genome-Wide Human SNP 
Array 6.0 and the Genome-Wide Human SNP NspUStyl 5.0 
Assay kit. Genotypes were called using the Affymetrix 
Birdseed- V2 calling algorithm and quality control was per- 
formed using GenABEL (39). Sample and SNP exclusion cri- 
teria were as applied previously (15). Briefly, samples were 
excluded if the per-sample call rate fell below 97%, if the 
autosomal heterozygosity (false discovery rate, FDR < 0.01) 
was too high or if they duplicated or were closely related to 
another sample in the study. Relatedness between study parti- 
cipants was estimated by the identity-by-state (IBS) statistic. 
In each pair showing an estimated proportion of alleles 
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IBS > 0.95, the sample with the lower call rate was excluded 
from further analyses. Quality control was performed on 
900 392 SNPs. SNPs were excluded if the minor allele fre- 
quency (MAF) fell below 1%, if they deviated from Hardy - 
Weinberg equilibrium (HWE) (P < 10" 4 ) or if the per-SNP 
call rate fell below 98%; 675 350 SNPs were left for analysis. 

Expression Data. Separation of monocytes was conducted 
within 60 min of blood collection. Blood was collected using 
the Vacutainer CPT Cell Preparation Tube System (BD, Hei- 
delberg, Germany) and the blood samples were enriched in 
monocytes by negative selection, using RosetteSep Monocyte 
Enrichment Cocktail (StemCell Technologies, Vancouver, 
Canada). This cocktail contains antibodies directed against 
cell surface antigens on human haematopoietic cells (CD2, 
CD3, CD8, CD19, CD56, CD66b) and glycophorin A on red 
blood cells. Total RNA was extracted the same day using 
Trizol extraction and purification by silica-based columns. 
Genome-wide expression assessment was performed using 
the Illumina HT-12 v3 BeadChip. Raw intensities were nor- 
malized in R (23), using VST transformation and quantile nor- 
malization as implemented in the lumi package (40,41). 
Probes were included in analysis if expression was considered 
detected (Illumina detection P < 0.01) in at least 90% of 
samples. 

Cardiogenics Transcriptome Study Samples 
Subjects. A total of 917 patients and healthy individuals of 
European descent were recruited in five centres within the 
CTS. Healthy individuals were recruited in Cambridge (« = 
458; UK). CAD and MI patients (n = 459) were recruited in 
Leicester (n = 161; UK), Liibeck (n = 102; Germany), 
Regensburg (n = 122; Germany) and Paris (n = 74; France). 
The study was approved by the Institutional Ethical Commit- 
tee of each participating centre. We restricted analysis to 558 
non-diabetic samples who had genetic, phenotypic and expres- 
sion data available at the time of this study. 

RNA Extraction. Monocyte isolation and RNA extraction were 
performed separately in each centre according to standardized 
procedures. All RNA samples were subsequently sent to the 
Paris centre for amplification, whole-genome microarray 
gene expression profiling and bioinformatics analysis. 

Blood samples (30 ml) from fasting subjects were collected 
into EDTA, and monocytes were isolated by positive selection 
with CD 14 microbeads (Miltenyi) according to the manufac- 
turer's instructions. Monocyte purity was measured as the per- 
centage of CD14+ cells analysed by flow cytometry. Half of 
the isolated cell preparation was immediately used for RNA 
extraction. Isolated monocytes were lysed in Trizol, and 
RNA was extracted in chloroform and ethanol, washed in 
RNeasy columns and incubated with DNasel before extracting 
in RNase-free water (Qiagen). RNA was quantified by the 
Nanodrop method before being transferred to Paris on dry ice. 

Genotyping. EDTA anticoagulated venous blood samples were 
collected from all participants. Genomic DNA was extracted 
from peripheral blood monocytes by standard procedures 
(Qiagen) and genotyped at either the Wellcome Trust Sanger 
Institute on the Human 610 Quad Custom Array (594 398 



SNPs and 66 049 CNVs), or the SNP&SEQ Technology Plat- 
form at Uppsala University, using the Sentrix Human Custom 
1.2 M array (1 115 839 SNPs and 80 128 CNVs). Samples 
were excluded based on per-sample call rate, outlying auto- 
somal heterozygosity, non-European ancestry, duplication 
and being closely related to another sample in the study; 
802 samples were kept for eQTL analyses. SNPs were 
excluded if the MAF fell below 1% in cases or in controls, 
if they deviated from HWE (P < 10 ) in controls, if the 
per-SNP call rate fell below 95% in cases or controls on the 
two Illumina arrays or if the MAF in controls was significantly 
different between the two Illumina arrays {P < 10 ). 

Expression Data. Gene expression profiling was performed 
using the Illumina Human Ref-8 Sentrix Bead Chip arrays 
(Illumina Inc., San Diego, CA, USA) containing 24 516 
probes corresponding to 18 311 distinct genes and 2 1 793 
Ref Seq annotated transcripts. mRNA was amplified and la- 
belled using the Illumina Total Prep RNA Amplification Kit 
(Ambion, Inc., Austin, TX, USA). After hybridization, array 
images were scanned using the Illumina BeadArray Reader, 
and probe intensities were extracted using the Gene expression 
module (version 3.3.8) of the Illumina BeadStudio software 
(version 3.1.30). Raw intensities were processed using the 
lumi (40,41) and beadarray (42) packages in R (23). All 
array outliers were excluded and only arrays with high con- 
cordance in terms of gene expression measures (pairwise 
Spearman correlation coefficients within each cell type 
>0.85) were included in the analyses. After data quality 
control, 849 monocyte RNA samples were available for statis- 
tical analyses. We analysed 558 non-diabetic samples in the 
eQTL analyses. 

T1D Case and Control Samples 

The case and control samples have been described before 
(4,43). Samples come from the Wellcome Trust Case 
Control Consortium (WTCCC) and the Type 1 Diabetes 
Genetics Consortium (T1DGC) GWAS. The WTCCC 
samples consist of 1930 T1D cases and 3342 controls geno- 
typed on the Affymetrix 500 k chip, and the T1DGC 
samples consist of 3983 T1D cases and 3999 controls geno- 
typed on the Illumina 500 k chip. Sample and SNP exclusion 
criteria were as applied previously (4). All subjects were of 
self-reported white European ancestry, samples were excluded 
based on per-sample call rate, outlying autosomal heterozy- 
gosity, non-European ancestry, duplication and being closely 
related to another sample in the study. SNPs were excluded 
if the MAF fell below 1% in cases or controls, if they deviated 
from HWE (P < 5.7 x 10" 7 ) in controls or if the per-SNP call 
rate fell below 95%. 

Alternative derivation of colocalization test 

We developed an alternative to the asymptotic likelihood- 
based approach of Plagnol et al. (3). Our alternative approach 
was necessary because the likelihood can be bimodal and is 
rarely quadratic near a maximum (Supplementary Material, 
Fig. SI), rendering the applicability of asymptotic likelihood 
ratio test theory questionable. There are some parallels 
between the alternative derivation and Fieller's theorem 
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(44), which allows for disjoint confidence intervals when like- 
lihoods are multimodal. 

We have two traits measured in independent cohorts Yi and 
Y 2 . Assume these traits are regressed (using linear or general- 
ized linear regression, as appropriate) on genotypes from the 
same set of p explanatory SNPs, X, producing maximum- 
likelihood estimates b\ and b 2 of regression coefficients j8i 
and /3 2 with variance -covariance matrices V\ and V 2 , respect- 
ively. Since sample sizes are very large, the combined likeli- 
hood may be closely approximated by a Gaussian likelihood 
for b\ and b 2 , assuming V\ and V 2 are known and that 
cov(Z?i, b 2 ) = 0. 

We assume equal LD in the two cohorts, i.e. that the rela- 
tionship between the causal variant and the genotyped SNPs 
associated with either trait does not differ between cohorts. 
Then, under the null hypothesis of colocalization, /3i oc fi 2 , 
i.e. (8i = (1/17)^2 = P (3). From Fieller's theorem (44), we 
may derive the \ 2 statistic: 

X 2 = u T V~ l u 

where u = b\ - (1/tj)^ 2 and V= V x + (l/r) 2 )V 2 . 

Note that this test statistic turns out to be identical to that 
proposed by Plagnol et al. (3) (see Supplementary Material). 
If 17 were known, X 2 would have a \ 2 distribution on p 
degrees of freedom. The difficulty arises because it is not 
known, and we must replace it by its maximum-likelihood es- 
timate, 7] (which also minimizes X 2 ). The asymptotic likeli- 
hood theory advanced by Plagnol et al. suggests that this 
results in min(X 2 ) having a x' distribution on p — 1 degrees 
of freedom, but this requires the log-likelihood for 17 to be 
well behaved (i.e. near quadratic) and this is not always the 
case; indeed, it is often bimodal. It is this behaviour of the 
likelihood for the ratio of regression coefficients which has 
given rise to the extended literature on Fieller's theorem. 

Instead, we use a posterior predictive P-value, first proposed 
by Rubin (16) and further developed by Meng (17) to allow for 
nuisance parameters (17 in our case). We begin by reparameter- 
izing the problem in terms of 8 = tan - 1 17. Then, replacing X 2 
by 

^^(^-tan-^'^^-tarT^)' 
the posterior predictive P-value is defined as 

[ T*(8)V(8)d8, 
Jo 

where T{6) is the P- value associated with T{6), and V(8) is 
the posterior distribution of 8, which we estimate assuming 
flat priors for 8 and /3 (see Supplementary Material). 

We also estimate a 95% credible interval for 17 by choosing 
values of 17 which have equal probability such that the area 
included between these limits is 95% of the total area under 
the posterior. Note that the credible interval may be disjoint 
in the case of a bimodal likelihood, as with Fieller's confi- 
dence intervals. In all cases where the credible interval is dis- 
joint, the two peaks of the bimodal likelihood are either side of 



zero, and therefore both positive and negative values of 17 are 
supported. 



T1D association testing of AFF3/2qll.2 

Previously, we found an association between T1D and AFF3/ 
2qll.2 in a GWAS meta-analysis (4). The most associated 
SNP in the region was rs4851256, but we could not design a 
TaqMan assay to genotype the replication resource. In the 
present study, a proxy was selected, rs4851253, which has 
r 2 = 1, with rs4851256 in the HapMap CEU samples. 

The replication case and control samples and family 
samples have been described previously (4). Case and 
control samples were analysed using logistic regression, 
adjusting for 12 broad geographical regions within Great 
Britain (southwestern, southern, southeastern, London, 
eastern, Wales, Midlands, North Midlands, northwestern, 
East and West Riding, northern and Scotland) to exclude the 
possibility of confounding by geography. These regions cor- 
respond to the place of collection for case and control subjects. 
We performed a 1 degree of freedom (df) likelihood ratio test 
to determine whether a 1 df multiplicative allelic effects 
model or a 2 df genotype effects model (no specific mode of 
inheritance assumed) was more appropriate. We assumed a 
multiplicative allelic effects model because it was not signifi- 
cantly different from the genotype model for rs4851253. 

The replication family samples were analysed using the 
transmission disequilibrium test and conditional logistic re- 
gression. The method proposed by Mantel (45) was used to 
combine the scores from replication case, control and family 
samples. However, as the SNP genotyped in the replication 
was a proxy SNP for rs4851256, we used Fisher's combined 
probability test to produce an overall statistic for association 
with T1D summarizing evidence in the GWAS and replication 
samples combined. 



eQTL identification 

We first used genome-wide genotype and fresh monocyte ex- 
pression data from 1370 non-diabetic subjects from the GHS 
(15) to generate a comprehensive map of both cis and trans 
monocyte eQTL patterns across known T1D loci listed in 
TIDBase (6) (Supplementary Material, Table SI). We con- 
ducted T1D locus- wide association testing of every probe in 
the GHS data set against every genotyped SNP which lay 
within the 19 Mb which comprise the T1D regions, using 
the snpMatrix (46) package in R (23). We excluded probes 
which were not determined to be expressed significantly 
above background in 90% of samples; annotated as 'bad' 
quality according to detailed analysis of Illumina expression 
arrays (47) or overlapped SNPs which were validated in 
dbSNP version 131 and, if frequency information was avail- 
able, not monomorphic in Europeans. We defined significant 
associations as those which had P < 10 -8 for cis effects 
(SNP < 5 Mb from probe) or P < 10" 10 for trans (SNP > 
5 Mb from probe or located on different chromosome) 
effects. These thresholds are two orders of magnitude lower 
than those established in the eQTL literature, as we are study- 
ing < 1% of the genome and because our aim here is to find 
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results worthy of follow-up, rather than to identify eQTLs 
definitively. 



Application of colocalization test 

Expression and disease status are rarely available in a single 
large cohort; the T1D GWAS and the GHS have been 
carried out using samples from different individuals. Instead, 
we test whether the coefficients from two separate regressions 
(expression and T1D against the same SNPs) may be consid- 
ered proportional. T1D cases and controls came from two 
GWAS, which had used different genome-wide SNP chips: 
the WTCCC (43) used the Affymetrix 500 k chip and 
the T1DGC (4) used the Illumina 550 k chip. In contrast, the 
GHS used the more recent Affymetrix 1 M chip and the 
CTS a combination of the Illumina 670 k and 1.2 M chips. 
Consequently, we need to take special care with imputation, 
which is now commonly used to expand the number of 
SNPs tested. When SNPs are imperfectly imputed, regression 
coefficients are biased towards the null. Thus, if the same SNP 
is imputed with different efficiency in the T1D and expression 
data sets, this could lead to perhaps false evidence against the 
null hypothesis of colocalization. On the other hand, restric- 
tion to SNPs directly genotyped in both data sets may not be 
appropriate when different chips have been used, as this can 
restrict the number of SNPs available to test and fail to 
capture the individual association signals adequately. We 
chose to use imputed SNPs across chips, using IMPUTE v2 

(48) , but restricted analysis to well-imputed SNPs (info > 
0.8), likely to induce only small bias in coefficients. 

We, therefore, conducted an initial colocalization analysis 
on these 60 genotype-expression associations, using directly 
genotyped and well-imputed SNPs in the GHS and WTCCC, 
both of which used Affymetrix chips. We attempted to 
expand the analysis into the CTS when data for the same 
probe were available; there was evidence of probe -genotype 
association in the region in the CTS (P < 0.0008; Bonferroni 
correction of a = 0.05 for 60 tests) and there was no strong 
evidence against colocalization of the GHS and CTS signals 
(P > 0.01). For such probes (50/60 occasions), we compared 
expression and T1D signals in the CTS and T1DGC data, 
using directly genotyped and well-imputed SNPs (both 
studies used Illumina chips). We combined the P-values 
from the two tests, using Fisher's method. When it was not 
possible to extend the analysis into the CTS, we used imput- 
ation to compare signals from the GHS with the T1DGC, 
the WTCCC and the combined T1D data in turn. In this 
case, the tests are not independent, and we used the 
minimum observed P-value as an overall measure of 
significance. 

For each putative eQTL, we extended the T1D region to 
include a 0.1 cM window surrounding the best eQTL SNP. 
Then, for each expression-TID data set under test, we took 
all SNPs in the extended window, and used lasso regression 

(49) to determine a subset which best explained the association 
of both the probe and T1D, according to the Bayesian informa- 
tion criterion. We tried two approaches: lasso on the expres- 
sion data set first, followed by the T1D data set including all 
SNPs chosen in the first stage, or vice versa. The model 



with the smallest combined Bayesian information criterion 
was preferred. 

Genotypes were imputed using CEU data from HapMap 
version 2, release 24, using IMPUTE v2 (48). All analysis 
was carried out using R (23) and the packages snpMatrix 
(46) for initial association testing of expression measures 
and glmnet (50) for lasso regression. 

URLS 

1000 Genomes, http://www.1000genomes.org; Cardiogenics 
Transcriptome Study, http://www.cardiogenics.eu; CRAN, 
http://cran.r-project.org; HapMap, http://www.hapmap.org; 
TIDBase, http://www.tldbase.org; Immunobase, http://www. 
immunobase.org. 

SUPPLEMENTARY MATERIAL 

Supplementary Material is available at HMG online. 
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